New cardinality estimation algorithms for HyperLogLog sketches

نویسنده

  • Otmar Ertl
چکیده

This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the estimate. The maximum likelihood approach can also be applied to more than a single HyperLogLog sketch. In particular, it is shown that it gives more precise cardinality estimates for union, intersection, or relative complements of two sets that are both represented by HyperLogLog sketches compared to the conventional technique using the inclusion-exclusion principle. All the new methods are demonstrated and verified by extensive simulations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New Cardinality Estimation Methods for HyperLogLog Sketches

Œis work presents new cardinality estimation methods for data sets recorded by HyperLogLog sketches. A simple derivation of the original estimator was found, that also gives insight how to correct its de€ciencies. Œe result is an improved estimator that is unbiased over the full cardinality range, is easy computable, and does not rely on empirically determined data as previous approaches. Based...

متن کامل

Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm

We describe a new cardinality estimation algorithm that is extremely space-efficient. It applies one of three novel estimators to the compressed state of the Flajolet-Martin-85 coupon collection process. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, the new algorithm simultaneously wins on all three dimensions of the time/space/accuracy tradeoff. Our proto...

متن کامل

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the...

متن کامل

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

​ —The information presented in this paper defines LogLog-Beta (LogLog-β). LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new al...

متن کامل

Coherent random permutations with record statistics

157 HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm A two-parameter family of random permutations of [n] is introduced, with distribution conditionally uniform given the counts of upper and lower records. The family interpolates between two versions of Ewens' distribution. A distinguished role of the family is determined by the fact that every sequence of coherent p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1702.01284  شماره 

صفحات  -

تاریخ انتشار 2017